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(57) Abstract 

An input utterance containing a command word to be recognized is processed (110) and features which adequately repres- 
ent the utterance are determined. Prestored features of a set of reference samples of command words (160) are compared (170) to 
the features of the input utterance. Recognition of command words in noisy environments is unproved by determinmg the dis- 
tance between the features of the input utterance and the features of the reference samples and modifying the distance (120) m 
response to background noise. The reference sample having the minimum distance is selected as the recognized command word. 
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METHOD AND APPARATUS FOR RECOGNIZING 
10 COMMAND WORDS IN NOISY ENVIRONMENTS 



Technical Field 

This invention relates generally to the field of word 
1 5 recognizers and in particular to those word recognizers which are 
capable of recognizing command words in noisy environments. 

Background 

Traditionally, the interaction between humans and devices 

20 has been achieved by some form of manual interaction, such as 
activating a switch or pushing a button. However, in many 
instances it may be advantageous or even necessary to interface 
with the device by means of a voice command. For example, a 
policeman in a police car may activate numerous functions, such 

25 as turning on the siren, by simply uttering an appropriate 

utterance which contains a word command. A word recognizer 
after receiving and processing the utterance, recognizes the word 
command and effectuates the desired function. 

Generally, a word recognizer recognizes the word 

30 command by extracting features which adequately represent the 
utterance, and making a decision as to whether these features 
meet a particular criteria. These criteria may comprise 
correspondence to a set of pre-stored features representing the 
command words to be recognized. 

35 The word recognizer may be speaker dependent or 

speaker independent. A speaker independent word recognizer is 
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designed to recognize the commands of potentially any number 
of users regardless of the differences in speech patterns, accents, 
and other variations in spoken words. However, the speaker 
independent word recognizer requires significantly sophisticated 
5 processing capability and hence has been constrained to 
recognizing a limited number of command words. 

A speaker dependent word recognizer is designed to 
recognize the command words of limited number of users by 
comparing the utterance to prestored voice templates which 

1 0 contain the voice features of those users. Therefore, it is 

necessary to train the word recognizer to recognize the voice 
features of each individual user. Training is commonly 
understood to be a process by which the individual users repeats 
a predetermined set of word commands for a sufficient number of 

1 5 times so that an acceptable number of their voice features are 
extracted and stored as reference features. 

One of the important characteristics of a word recognizer is 
its capability to accurately recognize a word command under 
various noise conditions. Typically, the word recognizers 

20 provides error rates of less than 1% in quiet environments. 

However, the error rate may be degraded by as much as 40% in 
environments where there is a 20 db peak signal-to-noise ratio 
(SNR). One of the factors contributing to poor noise performance 
is the difference between the training condition under which the 

25 reference features are derived and the operating condition under 
which the utterance features are derived. Accordingly, due to this 
difference, comparison of the reference features and input 
utterance features may produce substantially erroneous results. 
Many word recognizers incorporate noise compensation 

30 techniques in means utilized to derive the reference features. In 
one such word recognizer, a background noise estimator 
provides the ambient noise characteristics, and the prestored 
reference features are temporarily modified according to the 
characteristics of the ambient noise. The modified reference 

35 features and the input utterance features are then compared to 
each other, and the reference sample having features with the 
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closest similarity to the features of the input utterance is declared 
as the recognized word. 

In another type of word recognizer, the features of the 
input utterance are represented by the amount of energy 
5 contained within predetermined number of frequency bands. This 
technique is known as the filter banks method. In the word 
recognizer utilizing this technique the noise compensation is 
achieved by determining the back ground noise energy at every 
frequency band and subtracting it from the energy at the 

1 0 corresponding frequency band of the input utterance. The 
resulting features are then compared to the corresponding 
reference features, and again the reference sample having most 
similar features to the features of the input utterance is declared 
the recognized word. However, this type of system suffers an 

1 5 inherent draw back in that the number of predetermined 

frequency band is critical to the proper operation of the word 
recognizer. That is, dividing the voice spectrum into a high 
number of frequency band causes degradation in recognition 
accuracy of high pitched voices, and dividing the voice spectrum 

20 into a low number of frequency bands causes smearing effect on 
the voice signal. 

Other means of noise compensation in speech recognition 
utilize noise reduction techniques, wherein the signal to noise 
ratio is increased using various filtering techniques. However, 

25 practical improvements in SNR typically fall short of achieving a 
substantial accuracy in recognizing word commands. Another 
method of noise compensation for a system utilized in a severe 
noise environment is to train the system in a comparable noise 
environment. However, certain type of noise, such as acoustical 

30 background noise, are time variant in nature. Accordingly it is not 
possible to predict or otherwise reproduce, during training, the 
actual time variant noise which will exist during a subsequent 
speech recognition mode. 



35 
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Summary of the Invention 

Accordingly, it is an object of the present invention to 
provide a word recognizer apparatus capable of accurately 
recognizing command words under various noise conditions. 
5 Briefly, the word recognizer of the invention comprises a 

voice processing means for receiving an input utterance and 
determining features which adequately represent the utterance. 
A template means provides the pre-stored features of a set of 
reference samples which represent the recognizable command 

1 0 words. A noise analysis means determines ambient noise 

characteristics. A comparison means determines the distance 
between the features of the utterance and the reference samples. 
The comparison means is responsive to the ambient noise 
characteristics for modifying the determined distance. The word 

1 5 recognizer apparatus include means for determining the 

minimum distance and selecting the reference sample based 
thereon. 



Brief Description of the Drawings 

20 Figure 1 shows a block diagram of the word recognizer of 

the invention. 

Figure 2, shows a block diagram of the voice processor 
shown in Figure. 1. 

Figure 3, is the flow chart for extracting CSM features of an 
25 input utterance. 

Figure 4, shows the block diagram of the noise analyzer of 

FIG.1. 

Figure 5, shows a portion of the word recognizer of the 
invention which includes the block diagram of the template 
30 means of Figure 1 . 

Figure 6, shows the graph of the power distribution of the 
reference sample and the input utterance for a command word. 

Figure 7, shows a portion of the word recognizer of the 
invention which includes the block diagram of the comparison 
35 means of Figure 1 . 
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5 

Figure 8, is the flow chart of the steps taken according to 
the invention to recognize the word command in noisy 
environments. 



5 Detailed Description of the Preferred Fmbodiment 

Referring to FIG. 1 , a the block diagrams shown of a word 
recognizer 100 which utilizes the principals of the present 
invention for recognizing word commands. The word recognizer 
100 comprise an isolated word recognizer which is capable of 

10 recognizing more than one spoken word commands having a 
pause therebetween. The word recognizer 100 includes a voice 
processor 1 10 for processing an input utterance containing one 
or more word commands. The input utterance is received through 
a microphone 103 which produces a voice signal representing 

1 5 the input utterance. A well known audio filter 105 is used to limit 
the frequency spectrum of the input utterance to a predetermined 
range. In the preferred embodiment of the invention, the range of 
the audio filter 105 is confined to a range of 200 Hz to 3200 Hz. 
The voice processor 1 10 divides the input utterance in to frames 

20 of predetermined duration. The voice processor 1 1 0 provides, in 
each frame, those features of the input utterance which 
adequately characterize the input utterance. The detailed process 
by which these features are produced is described later. These 
features comprise frequency components and corresponding 

25 amplitudes as well as the power of the input utterance in each 
frame. A background noise analyzer 120 provides the 
characteristics of the ambient noise. These characteristics 
comprise signal to noise ratio in the frequency spectrum and the 
level of the ambient noise floor. Because the word recognizer 

30 100 is an isolated word recognizer, the beginning and the end of 
the input utterance must be determined. In the preferred 
embodiment of the invention, this determination is made by 
comparing the power of the input utterance to the power of the 
ambient noise floor. When the power of the input utterance 

35 exceeds the ambient noise floor a comparator 1 30 closes a 

switch 140, thereby allowing the features of the input utterance to 
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be stored in a temporary feature storage means 150. When the 
power of the input utterance falls below the ambient noise floor 
the switch 140 is opened preventing features from being stored 
in the storage means 150. Accordingly, the end points of the 
5 input utterance are -determined by comparing the ambient noise 
floor to the power of the input utterance. A template means 1 60 
provides the features of a set of prestored reference samples. 
The features of the prestored reference samples are generated, 
during training, utilizing the same process as that which provides 

1 0 the features of the input utterance. As subsequently described 
herein, the template means 160 aligns the end points of the 
reference sample with the end points of the input utterance. A 
comparison means 170 primarily comprising a 
microcomputer/controller provides the distance between the 

15 features of the input utterance and the reference samples. The 
detail of the process by which the distance between the features 
of the input utterance and the reference sample are produced is 
described later. The comparison means 170 then selects the 
reference sample having the minimum distance with the features 

20 of the input utterance and based thereon declares the word 
command. Noise compensation in the word recognizer of the 
invention is achieved by eliminating or modifying the distance 
between the features of the input utterance and the features of the 
reference sample having noise characteristics above a 

25 predetermined threshold. 

Referring to FIG. 2 f the block diagram of the voice 
processor 110 comprises an A/D converter 102 which samples 
the voice signals provided by microphone 1 03 of FIG. 1 at a 
suitable sampling rate.such as 8000 samples per second. A 

30 frame buffer 104 buffers the sampled signal and provides frames 
which consist of a predetermined number of consecutive voice 
samples. The framing technique utilized by the frame buffer 104 
is well known in the art, and the frames provided by the preferred 
embodiment of the invention comprise 160 samples which 

35 correspond to a frame duration of 20 msec. It may be appreciated 
that depending on the duration of each input utterance a variable 
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number of frames (designated as N) may be generated by the 

frame buffer 1 04. 

The features characterising each frame utterance may be 
parametric or discrete. The discrete features of the utterance 

5 frames may be provided by such known techniques as the filter 
banks method. The embodiment of the present invention utilizes 
a technique which provides the parametric features of the 
utterance frame. The parametric features of the utterance may be 
provided by such known techniques as linear predictive analysis 

1 0 (LPC) or composite sinusoidal modeling (CSM). In the preferred 
embodiment of the invention, the features of the utterance frames 
are provided utilizing conventional CSM analysis techniques as 
described in S. Sagayama and F. Ikatura, "Duality Theory of 
Composite Sinusoidal Modelling and Linear Prediction", ICASSP 

1 5 '86 Proceedings, vol 3, pp. 1 261 -1 264, the disclosure of which is 
hereby incorporated by reference. The purpose of CSM analysis 
is to determine a set of CSM features which adequately 
characterize the frame utterance. The CSM features comprise 
CSM frequencies { fj } and amplitudes { mj } which correspond 

20 thereto. The number of CSM features (designated as M) of each 
frame of the input utterance is related to the frequency range of 
the utterance. In utterances confined to a range of 200 Hz to 
3200 Hz in frequency spectrum* there usually exists four formant 
resonant frequencies below 3200 Hz. Thus, it is usually sufficient 

25 to utilize 4 CSM frequencies and amplitudes to characterize the 
input utterance frames. Therefore, in the preferred embodiment of 
the invention, the number of features (designated as M) is equal 
to 4. A feature extractor 106 executes a feature extraction 
process utilizing conventional CSM techniques which as shown 

30 in the flow chart of FIG.3. 

According to FIG. 3, the CSM extractor 106, at block 310, 
applies the input utterance features and computes the 
autocorrelation of the frame utterances at block 320. The term of 
the interpolative correlation is then computed, block 330. At block 

35 340, the CSM extractor 1 06 solves a Hankei matrix for providing 
the coefficients of a polynomial: 
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P n + +Pixn-1 +X n=0 (1). 

Then the real roots { xj } of the equation (1 ) are provided, block 
350. The amplitude CSM features, { mj }, are provided by the 
matrix shown in a block 360, and the frequency CSM features, { fj 
5 }, are provided by the following equation: 

{fj}=Cos" 1 xj (2). 
In addition to amplitude and frequency CSM features, the feature 
extractor 106 also provides the power content of the frame input 
utterance frame derived from the following equation: 
10 N 

P(n)-(1/N) X{T(i)}2 ( 3 ). 
i=1 

Accordingly, the features of the input utterance for each frame 
may be represented by a composite vector: 
15 T (n) ={ m-j n , m 2 n , m M n , f 1 n , f 2 n ,f M n , P (n) > < 4 > 

and the entire utterance may be represented by 

{T , T ,T } (5). 

12 N v ' 

One of ordinary skill in the art may appreciate that the voice 

20 processor 1 10 described in FIG. 2 and in FIG. 3 may be 

implemented by means of any suitable digital signal processor 
(DSP), such as 56000 series family of DSPs manufactured by 
Motorola, Inc. 

Referring to FIG. 4, the block diagram of a well known 

25 noise analyzer 120 is shown. The noise analyzer 120, 
continually monitors the background noise and provides 
characteristics thereof. The noise analyzer 120 includes a noise 
processing means 122 for producing the noise powers of the 
desired frequency spectrum. The noise processing means 122 

3 Q utilizes well known analysis techniques, such as Fast Fourier 
Transformation analysis, to provide noise power at desired CSM 
frequencies. The noise processor 122 also receives the 
corresponding CSM amplitudes of the input utterance frames and 
produces the signal to noise ratios SNR (f) at the CSM 

35 frequencies. Additionally the noise analyzer 120 includes a well 
known noise averaging means i 24 which provides the power at 



WO 91/11696 



PCT/US91/00053 



9 

noise floor Rn. The techniques for providing ambient noise floor 
is well known in the art. 

Referring to FIG. 5, the block diagram of the template 
means 160, which in the preferred embodiment of the invention, 
5 operates under the control of the comparison means 170, is 

shown. The template storage means 162 stores the features of a 
set of reference samples representing word commands 
recognizable by the word recognizer 100. These reference 
features have been obtained during a training process. During 

1 0 the training process, a user repeats each of the desired word 
commands to be recognized a number of times. Preferably, the 
training of the word recognizer is performed in a quiet 
environment. The features of the user voice are extracted and 
stored in the template storage means 1 62 as the reference 

1 5 samples. During training, the utterances are processed identically 
to the processing of the input utterance. In fact, the voice 
processor 110 is used to generate the reference sample features 
during training of the word recognizer 100. It may be appreciated 
that the number of reference sample frames (designated as J) 

20 may be different from the number of the corresponding input 
utterance frames N. It should be noted that the powers of each 
frame as derived from equation (3) are also included in the 
features of the reference sample. Accordingly, the features of the 
each reference sample may be stored in the template storage 

25 means 162 as vectors: ... . 
R (j) ={ mj, m 2 i mj, f/ i 2 l W- p (j) 

and the all of the recognizable reference samples are 
represented as: 

30 

W 1={R 1 1 R 1 j(D> < 7) 



35 



^-{R^ R K j(K )} where (8) 

j(1) J(K) = number of frames in the reference 

samples, and 
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K= number of reference samples. 
In operation and under the control of comparison means 
170, each of these reference samples are selected and compared 
to the input utterance. In order to achieve an effective 
5 comparison, the end points of the reference sample under 

comparison and the input utterance must be aligned. However, 
because the features of the reference sample are generated in a 
quiet environment and these same features are used for 
comparison under noisy conditions, the end points of the 

1 0 reference sample and the input utterance may become 

misaligned. In the preferred embodiment of the invention, an end 
point aligner 164 is included in the template means 160 to 
alleviate end point misalignments. FIG. 6 shows in time domain 
the power contour 610 of a reference sample for a word 

1 5 command. It may be appreciated that the power contour 61 0 of 
the reference sample can actually be represented by a number of 
discrete powers corresponding to each frame. However, for the 
sake of simplicity and ease of understanding the contour of the 
power distribution of the reference sample is shown as a solid 

20 line 610. Similarly, the power contour of an input utterance 
substantially corresponding to that of the reference sample is 
shown by a dotted line 620. As shown, the end points of the 
reference sample in quiet background and the input utterance in 
noisy environments are separated from each other by the 

25 ambient noise floor power R (n). It may be appreciated that if the 
end points of the reference sample are readjusted by a number 
of frames such that the subsequent frames have powers above 
the noise floor power, the end points of the reference sample and 
the input utterance may be realigned. Therefore, the noise floor 

30 power Rn provided by noise analyzer 120 constitutes a threshold 
by which the end points of the reference sample are readjusted. 
Referring back to FIG.5, the end point aligner 164 skips those 
candidate endpoints whose power are below the noise power. 
One of ordinary skill in the art appreciates that the the end point 

35 aligner 164 may be implemented by means of any suitable 
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microcomputer or DSP executing a suitable program for 
achieving the intended purpose thereof. 

Referring to FIG. 7, the comparison means 170 comprise a 
well known a microcomputer/controller, such as the 68000 family 
5 of microcomputers manufactured by Motorola, Inc. The 
comparison means 170, among other things, includes a 
controller 172, a computer 174, a RAM 176 and a ROM 178. The 
controller 172 performs several functions which include 
controlling the operation of the comparison means 170 and the 

1 0 template means 160 as well as interacting with the temporary 
storage means 150 and noise analyzer 120. The computer 174 
performs the computational functions of the comparison means 
170. The RAM 176 provides a temporary information storage for 
the computer 174 and the controller 172. The program containing 

1 5 the operational steps of the computer 174 and the controller 172 
is stored in the ROM 178. 

The operation of the comparison means is described in 
conjunction with the flow chart shown in FIG 8. At block 81 0, the 
controller 172 receives the features of the input utterance from the 

20 temporary storage means 1 50. At block 820, the features of the 
first reference sample after endpoint alignment are received from 
the template means 160. At block 830, the computer 174 
determines the distance between the features of the reference 
sample and the input utterance. In the preferred embodiment of 

25 the invention, only the frequency features of the frames of the 

utterance and the reference sample are utilized for computing the 
distance. The determined distance is called a local distance 
metric and is computed from the following equation: 



30 



M 

d (nj) =X (T(i.n)-R(i, j)) : 
i=1 



2 



(9) 



where 



1<= i <= M is the Index of composite sinusoidal 



features, 



35 



1 <= n <= N is the time index of utterance, 

1 <= j <= J is the time index of the reference sample, 
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T (i,n) represents the i th composite sinusoidal 
frequency in the n th frame of said utterance, 

R (if j) represents the i th composite sinusoidal 
frequency in the j th frame of said reference sample. 
5 After the local distance for each frequency feature of every frame 
is calculated, the local distant metric d is modified by a function 
W(i, n) of the signal to noise ratio SNR(f) provided by the noise 
analyzer 120. The function W(i,n) may be defined as: 

W(i,n)=F[SNR](f)} (10). 
10 Therefore the modified local distance may be defined as: 

M 

d=Z (T(i.n)-R(i, j)) 2 * W(i,n) / K , (11) 
i=1 



15 



20 



25 



where K is the normalization constant defined by: 

N M 
K=Z XW(i.n). 
n=1 i=1 

In the preferred embodiment of the invention W(i,n) comprises a 
discrete function defined by: 

w^n\ /1 lfSNR(f)>N.T. 

w l'' n M 0 If otherwise. (12) 



where N.T= signal to noise ratio threshold. Accordingly, the ith 
frequency features of the nth frame is eliminated, if the SNR(f) at 
that frequency is below the SNR threshold. The W(i,n) may 
comprise a continuously differentiate limiting function, such as 

30 well known sigmoidal or hyprobolic tangent functions . It may be 
appreciated that for each frame of the input utterance there is total 
of at most J local distances. The legal local distance minimum of 
each input utterance frame are added to subsequent local 
distances. An accumulated distance is thus determined for each 

35 reference sample frame, block 840. One of ordinary skill in the art 
may appreciate that under predetermined boundary and 
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continuity conditions a minimum distance may be obtained 
utilizing well known dynamic time warping techniques. One such 
technique is described in ITAKURA, "Minimum Prediction 
Residual Principle Applied to Speech Recognition", IEEE 

5 proceedings on Acoustics, Speech, & Signal Processing, vol 
ASSP-23, No.1, pp. 67-72, February, 1975 which is hereby 
incorporated by reference. At block 850, the minimum distance 
utilizing such a technique is computed and stored. In block 860, a 
decision is made to determine whether more reference samples 

1 0 are to be processed. After comparing all of the stored reference 
samples, block 870, the reference sample having the minimum 
distance is selected. In block 880, the command word contained 
in the input utterance is recognized based on a decision on the 
selected reference sample. The decision also takes into 

1 5 consideration a predetermined criteria before the recognized 
command word is declared. Such criteria may comprise a 
threshold minimum distance below which the recognized word is 
valid. This predetermined criteria prevents declaring an invalid 
input utterance, which produces a minimum distance, the 

20 recognized word command. 

As described, during the recognition process, the local 
distances between the features of the input utterance and the 
reference sample are relied upon in recognizing the command 
word. The local distances are modified as a function of the signal 

25 to noise ratio. Accordingly, the accuracy of the word recognizer 
under severe noise conditions is improved by eliminating or 
lessening the contribution of those local distances which have an 
undesirable noise characteristic. 
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Claims 

1. An apparatus for recognizing command words in an 
utterance comprising: 

voice processing means for determining features 
5 representing said utterance; 

template means for providing features of a plurality 
of reference samples representing command words to be 
recognized by said apparatus; 

noise analysis means for determining ambient noise 
1 0 characteristics; 

comparison means including means for determining 
the distances between features of said utterance and each 
reference sample being responsive to said ambient noise 
characteristics for modifying the determined distance, 
1 5 decision means including means for determining the 

minimum modified distance, and mean for recognizing the 
command word based on said minimum distance. 



WO 91/11696 



PCT/US91/00053 



15 

2. The apparatus of claim '1 , wherein said features of said 
utterance and said reference samples are parametric features. 

3. The apparatus of claim 2, wherein said template means 
5 comprises means for aligning said utterance and said reference 

sample end points. 

4. The apparatus of claim 2, wherein said parametric 
features include frequencies and corresponding amplitudes for 

1 0 said frequencies. 

5. The apparatus of claim 4, wherein said ambient noise 
characteristics include signal to noise ratio at said frequencies. 

15 

6. The apparatus of claim 5, wherein said comparison 
means is responsive to said ambient noise characteristics for 
modifying said distance as a function of said signal to noise ratio. 

7. The apparatus of claim 6, wherein said distance is 
20 modified when said signal to noise ratio exceeds a 

predetermined threshold. 

8. The apparatus of claim 6, wherein said function for 
modifying said distance comprise a continuously differentiable 

25 limiting function. 



30 



35 
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9. The apparatus of claim 6, wherein said distance 
comprises a local distance metric defined by: 
M 

5 d=Z(T(i f n)-R(i,j)) 2 

1=1 

where 

1<= i <= M is the Index of composite sinusoidal 

features, 

10 1 <= n <= N js the time index of utterance, 

1 <= j <= J is the time index of the reference sample, 
T (i,n) represents the i th composite sinusoidal 
features in the n th frame of said utterance, 

R (i, j) represents the i th composite sinusoidal 

1 5 features in the j th frame of said reference sample 



20 



10. The apparatus of claim 9, wherein said recognition 
means include means for determining the minimum distance by 
utilizing a dynamic time warping technique. 
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11 . A method for recognizing command words in an 
utterance comprising: 
5 a) determining features of said utterance; 

b) providing features of a plurality of reference 
samples representing command words to be recognized; 

c) determining characteristics of ambient noise; 

d) determining the distance between features of 

1 0 said utterance and features of each of said plurality of reference 
samples; 

e) modifying the distance between features of said 
utterance and features of each of said plurality of reference 
samples in response to said characteristics of ambient noise; 

15 f) determining the minimum of the modified 

distance; and 

g) recognizing the command word based on the 

minimum distance. 
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12. The method of claim 1 1 , wherein said step (a) 
comprises determining parametric features of said utterance and 
step (b) comprises providing parametric features of reference 
samples. 

5 

13. The apparatus of method 12 f wherein said steps (b) 
includes aligning of said reference sample and said utterance 
end points. 

10 14. The method of claim 12, wherein steps (a) includes 

determining frequencies and corresponding amplitudes for said 
frequencies and step (b) includes providing composite sinusoidal 
frequencies and amplitudes at said frequencies. 

15 15. The method of claim 14, wherein said step (c) includes 

determining signal to noise ratio at said frequencies. 

16. The method of claim 15, wherein said step (e) 
comprise modifying the distance between features of said 

20 utterance and features of each of said plurality of reference 
samples as a function of signal to noise ratio. 

17. The method of claim 16, wherein said step (e) 
comprise modifying the distance between features of said 

25 utterance and features of each of said plurality of reference 

samples when the signal to noise ratio exceeds a predetermined 
ratio. 

18. The method of claim 16, wherein said step (e) 
30 comprises modifying the distance between features of said 

utterance and features of each of said plurality of reference 
samples when the signal to noise ratio exceeds a predetermined 
ratio. 



35 
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NM19. The of method of claim 25, wherein said step (d) of 
determining the distance between composite sinusoidal features 
of said utterance and composite sinusoidal features of each of 
5 said plurality of reference samples is derived from a local 
distance metric function defined by: 

M 

d=Z (T(i.n)-R(i, j)) 2 
10 i=1 
where 

1<= i <= M is the Index of composite sinusoidal 

features, 

1 <= n <= N is the time index of utterance, 
15 1 <= j <= J is the time index of the reference sample, 

T (i,n) represents the i th composite sinusoidal 
features in the n th frame of said utterance, 

R (i, j) represents the i th composite sinusoidal 
features in the j th frame of said reference sample 

20 

20. The method of claim 16, wherein said step (f) 
comprises determining the minimum of the modified distance by 
utilizing a dynamic time warping technique. 



WO 91/11696 



PCT/US91/00053 



115 



r 



103 



105 



r 



FILTER 



r 



110 



1 



VOICE 
PROCESSOR 



£. 



140 



~7K 



£ 



130 



$ COMPARATOR 



"7Tn" 



r 



220 



NOISE 
ANALYSIS 



160 



TEMPLATE K 
MEANS 



^-250 

TEMPORARY 
FEATURE 
STORAG E 

~7F 



£ 



170 



COMPARISON 
MEANS 



"7TT 




£ 



106 



CSM EXTRACTOR 




CSM 
FEATURES 



FIG.2 



WO 91/11696 



PCT/US91/00053 



215 



( START ) 



C 



310 



RECEIVE INPUT 
UTTERANCE FRAMES 



c 



320 



COMPUTE 
AUTOCORRELATION 
OF FRAMES 



330 



COMPUTE 
INTERPOLATING 
CORRELATION 



r 



340 



FIND 
COEFICIENTS OF 
POLYNOMIAL 



r 



350 



FIND ROOTS OF 
POLYNOMIAL 



360 



FIND CSM 
FREQUENCIES AND 
AMPLITUDES 



FIG. 3 



FIG. 4 



Rn 



NOISE 



AVERAGING 
MEANS 

17 



124 



"3a 



PROCESSOR 



120 



122 



J 



-> SNR(f) 



WO 91/11696 



PCT/US91/00053 



3/5 




110 



VOICE 
PROCESSOR 



— Y" 

TRAINING 



162 



STORAGE 
MEANS 



164 



ENDPOINT 
ALIGNER 



"7F 



160 



fig. a 



r 



170 



COMPARISON 
MEANS 



120 



NOISE ANALIZER 



150 



TEMPORARY STORAGE 



-176 



178 




-174 



COMPUTER 
7K 



172 



CONTROLLER 
7T\ — 



TEMPLATE MEANS 



r 



160 



FIG. 7 



WO 91/11696 



PCT/US91/00053 



ENDPOINTS IN QUIET BACKGROUND 
ENDPOINTS IN NOISE 




FIG. 6 



PCI7US91/00053 



c 



5/5 



START 



3 



r 



810 



GET FEATURES OF 
INPUT UTTERANCE 



820 



GET FEATURES OF 
REFERENCE SAMPLE 



r 



830 



COMPUTE THE LOCAL 
DISTANCE 

AND MODIFY BASED 
ON SNR 



840 



COMPUTE ACCUMULATED 
SUM OF DISTANCES 



r 



850 



FIND THE MINIMUM 
ACCUMULATED DISTANCE 




SELECT MINIMUM 
DISTANCE 



r 



880 



RECOGNIZE 
COMMAND WORD 



FIG. 8 



INTERNATIONAL SEARCH REPORT 

International Application No PCT/US91/00053 



I. CLASSIFICATION OF SUBJECT MATTER (i f several classification symbols apply, indicate all) * 
According to International Patent Classification (IPC) or to both National Classification and IPC 

IPC (5): G01L 7/08 

n.s. fx.: 381/43 : 

II. FIELDS SEARCHED 

Minimum Documentation Searched * 
Classification System . Classification Symbols 



381/41, 46, 110 364/513.5 367/198 

Documentation Searched other than Minimum Documentation 
to the Extent thai such Documents are Included in the Fields Searched * 



til DOCUMENTS CONSIDERED TO BE RELEVANT i* 


Category * \ 


Citation of Document, 


with indication, where appropriate, of the relevant passages » ; 


| Relevant to Claim No. »* 


Y 


US, A, 


4,829,578 


(Roberts), 9 May 1989. 


1-18, 


20 


Y 


US, A, 


4,852,181 


(Morito et al), 25 July 1989. 


1-18, 


20 


Y 


US, A, 


4,897,878 


(Boll et al), 30 January 1990. 


1-18, 


20 


Y,P 


US,A, 


4,918,732 


(Gerson et al), 17 April 1990. 


1-18, 


20 


Y,P 


US, A, 


4,933,973 


(Porter) 12 June 1990. 


i i-is, 


20 


Y 


UK,A, 


2,137,791 


(Bridle et al.), 10 October 1984. 


I 1-18, 

1 


20 



* Special categories of cited documents: > a 

"A" document defining the general state of the art which is not 

considered to be of particular relevance 
"E" earlier document but published on or after the international 

filing date 

•V document which may throw doubts on priority clatm(s) or 
which is cited to establish the publication date of another 
citation or other special reason (as specified) 

m O m document referring to an oral disclosure, use, exhibition or 
other means 

M P" document published prior to the international filing date but 
later than the priority date claimed 



-T- later document published after the international fiUng date 
or priority date Snd not in conflict with the application but 
cited to understand the pnndple or theory underlying the 
invention 



document of particular relevance; the claimed Jnv«ntlon 
cannot be considered novel or cannot be considered to 
involve an inventive step 
-Y- document of particular relevance; the claimed Jnvantion 
cannot be considered to involve an inventive step when the 
document is combined with one or more other such docu- 
ments, such combination being obvious to a person skilled 
in the art 

"AT document member of the same patent family 



IV. CERTIFICATION 



Date of the Actual Completion of the International Search * | Date of Mailing of this l^*^*^ S iJCMf p0ft ' 

18APR 



15 March 1991 



d 



International Searching Authority 1 

ISA/US 



Signature of Authorized >Qfflcer to 

John A. Merecki 




Form PCT/ISA/210 (second sheet) (May 1986) 



International Application No. 



PCT/US91/00053 



FURTHER INFORMATION CONTINUED FROM THE SECOND SHEET 



V-D OBSERVATIONS WHERE CERTAIN CLAIMS WERE FOUND U NSEARCHABLE i 

This international search report has not been established fn respect of certain claimsunder Article 17(2) (a) for the following reasons: 
Claim numbers . because they relate to subject matter i not required to be searched by this Authority, namely: 



Claim numbers . because they relate to parts of the International application that do not comply with the prescribed require- 

ments to such an extent that no meaningful international search can be earned out *, specifically: 



3. |3 Claim numbers L2 . because they are dependent claims not drafted in accordance with the second and third sentences of 

PCT Rule 6.4(a), 

VI. □ OBSERVATIONS WHERE UNITY OF INVENTION IS LACKING' 



This International Searching Authority found multiple inventions in this international application as follows: 



1<D nVthi 1 in^!! 3 ^" 10 ".! 1 " arCh f " S Were Hme,y paid by the *PP"<*nt. this international search report covers all searchable ciaims 
or the international application. 

^ tt«« l!5.*!.T«?! h lh " ? quir | ? d addiUo f nal fees were limel> paid by the applicant, this international search report covers only 

those claims of the international application for which fees were paid, specifically claims: 

In^i f . e l a fi dl ! ,0na ^" a 2 h . f6 ^ WCfe timC ' y PaW bV the a " ,icant * Consequently, this international search report is restricted to 
the invention first mentioned in the claims; it is covered by claim numbers: 

efl ° rt iust,fyin ° an addi «°" a ' tee. the International Searching Authority did not 

Remark on Protest 

□ 

The additional search fees were accompanied by applicant's protest. 
□ No protest accompanied the payment of additional search fees. 



Form PCT/JS A/210 (supplemental inset (2) (Rev. 4-90) 



